{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "!pip install --progress-bar off poetry\n",
    "!pip install --progress-bar off git+https://github.com/oughtinc/ergo.git@e92c684d45ebebea0d6bb43172b284fc53d067fc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings(action=\"ignore\", category=FutureWarning)\n",
    "warnings.filterwarnings(action=\"ignore\", module=\"plotnine\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from datetime import datetime\n",
    "\n",
    "import ergo\n",
    "from ergo.platforms.metaculus.question import MetaculusQuestion, LinearQuestion, LogQuestion, ContinuousQuestion"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.set_option('display.max_rows', None)\n",
    "pd.set_option('display.max_columns', None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Get questions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get all *open* questions on the specified subdomains.\n",
    "\n",
    "NOTES:\n",
    "1. log date questions are excluded because Ergo currently can't handle them\n",
    "2. questions that are in multiple subdomains will be included multiple times, once for each subdomain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "subdomains = [\"www\", \"pandemic\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "qs = []\n",
    "\n",
    "for subdomain in subdomains:\n",
    "    metaculus = ergo.Metaculus(username=\"oughtpublic\", password=\"123456\", api_domain=subdomain)\n",
    "    qs = qs + metaculus.get_questions(question_status=\"open\", pages=99999, load_detail=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# For questions with open boundaries,\n",
    "# the undetailed version of these questions is missing\n",
    "# the probability above and below the question bounds, which we'd like to have.\n",
    "# So, fetch the full question data.\n",
    "for q in qs:\n",
    "    if getattr(q, \"low_open\", False) or getattr(q, \"high_open\", False):\n",
    "        q.refresh_question()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "exemplar_q_id = 3530"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "exemplar_q = metaculus.get_question(exemplar_q_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "qs_df = exemplar_q.to_dataframe(qs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Get question metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get all of the metadata already on the question"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the field names from the question JSON from Metaculus:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "metaculus_json_fields = list(exemplar_q.data.keys())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the property names from the MetaculusQuestion and ContinuousQuestion classes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def properties(some_class):\n",
    "    \"\"\"\n",
    "    Get all @properties of a class\n",
    "    \"\"\"\n",
    "    class_items = some_class.__dict__.items()\n",
    "    return [name for (name, value) in class_items if isinstance(value, property)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "question_properties = properties(MetaculusQuestion)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "continuous_question_properties = properties(ContinuousQuestion)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "q_fields = metaculus_json_fields + question_properties + continuous_question_properties\n",
    "\n",
    "simple_fields = [field for field in q_fields if type(getattr(exemplar_q, field)) in [bool, int, float, str, datetime]]\n",
    "\n",
    "# This property causes an exception for some reason.\n",
    "# Didn't seem worth investigating\n",
    "simple_fields.remove(\"question_range_width\")\n",
    "\n",
    "for field in simple_fields:\n",
    "    qs_df[field] = [getattr(q, field, None) for q in qs]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate and add more metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_p_outside(q):\n",
    "    if not hasattr(q, \"latest_community_percentiles\"):\n",
    "        return None\n",
    "\n",
    "    # q.latest_community_percentiles is a float for binary questions:\n",
    "    # https://github.com/oughtinc/ergo/pull/378\n",
    "    if type(q.latest_community_percentiles) == float:\n",
    "        return None\n",
    "    \n",
    "    return q.latest_community_percentiles[\"low\"] + (1 - q.latest_community_percentiles[\"high\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "metadata_columns = {\n",
    "    \"type\": lambda q: type(q).__name__,\n",
    "    \"subdomain\": lambda q: q.metaculus.api_domain,\n",
    "    \"num_boundaries_open\": lambda q: int(q.low_open) + int(q.high_open) if hasattr(q, \"low_open\") else None,\n",
    "    \"question_scale_low\": lambda q: q.scale.low if hasattr(q, \"scale\") else None,\n",
    "    \"question_scale_high\": lambda q: q.scale.high if hasattr(q, \"scale\") else None,\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for (name, fn) in metadata_columns.items():\n",
    "    qs_df[name] = [fn(q) for q in qs]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Select and reorder columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have these columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "qs_df[qs_df[\"id\"] == exemplar_q_id].head(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Select all the ones that might plausibly be useful, and put them in a reasonable order:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "qs_df = qs_df[[\n",
    "    \"id\",\n",
    "    \"title\",\n",
    "    \"question_url\",\n",
    "    \"subdomain\",\n",
    "    \"type\",\n",
    "    \"num_boundaries_open\",\n",
    "    \"low_open\",\n",
    "    \"high_open\",\n",
    "    \"p_outside\",\n",
    "    \"question_scale_low\",\n",
    "    \"question_scale_high\",\n",
    "    \"anon_prediction_count\",\n",
    "    \"last_activity_time\",\n",
    "    \"votes\",\n",
    "    \"comment_count\",\n",
    "    \"created_time\",\n",
    "    \"publish_time\",\n",
    "    \"close_time\",\n",
    "    \"resolve_time\",\n",
    "    \"author_name\",\n",
    "    \"last_read\",\n",
    "    \"has_predictions\",\n",
    "    \"activity\",\n",
    "    \"title_short\"\n",
    "]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "qs_df[qs_df[\"id\"] == exemplar_q_id].head(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Explanations of some important fields\n",
    "- `low_open`: Is the lower boundary of the question open? (only applies to ContinuousQuestions)\n",
    "- `high_open`: Is the upper boundary of the question open? (only applies to ContinuousQuestions)\n",
    "- `p_outside`: How much of the total probability mass of the community prediction is outside the question range? (only applies to ContinuousQuestions)\n",
    "- `anon_prediction_count`: Seems to be a proxy for the number of predictions. See \"Data notes\" below.\n",
    "- `last_activity_time`: Seems to be a quick proxy for the time of the last prediction on the question. See \"Data notes\" below.\n",
    "- `comment_count`: How many comments have been left on this question?\n",
    "- `created_time`: When did the author of the question create it? (I think)\n",
    "- `publish_time`: When was the question published to all Metaculus users?\n",
    "- `close_time`: After what time are predictions on this question no longer allowed?\n",
    "- `resolve_time`: When can the question be resolved, i.e. when will the answer be known?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data notes:\n",
    "1. `anon_prediction_count` is the closest thing I could find to a count of number of predictions, but I'm not really sure how it relates to the number of predictions. In my testing:\n",
    "    1. It seems to always be the same as the length of `prediction_timeseries`\n",
    "    2. It seems to be correlated with something about the number of predictions shown. E.g.\n",
    "        1. it's 101 for this question where the community prediction is shown: https://www.metaculus.com/questions/3530/how-many-people-will-die-as-a-result-of-the-2019-novel-coronavirus-covid-19-before-2021/.\n",
    "        2. While it's 0 for this question where the community prediction is not shown yet: https://www.metaculus.com/questions/4614/when-will-directly-removing-carbon-dioxide-from-the-atmosphere-be-economically-feasible/ \n",
    "    3. I couldn't get it to increment. I tried:\n",
    "        1. making a new prediction with an account that had already predicted on the question\n",
    "        2. making a prediction with an account that had never predicted on that question before.\n",
    "2.`last_activity_time` seems like the most obvious easy proxy for when the most recent prediction was made. However, I'm not sure how reliable it is.\n",
    "    1. It did not update when I made a new prediction from an account that had already predicted on the question previously\n",
    "    2. It may update when people leave comments or at other times\n",
    "    3. Alternatively, we could use the last time from the `prediction_timeseries`, but that also doesn't seem to update every time someone makes a prediction\n",
    "3. To get the datetime of the last posted comment, I think we'd need to retrieve it from a separate API (prob at least 30 min of work, maybe more like hours), so I haven't tried\n",
    "4. Log date questions are excluded here because Ergo can't handle them yet."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# View data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Export as csv\n",
    "\n",
    "(For use when running locally in `ergo/notebooks`)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# qs_df.to_csv(\"../ergo/contrib/metac_qs_data/metac_qs_data.csv\", index=False, float_format='%.20f')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A version of this CSV is uploaded as a [Google Sheet](https://docs.google.com/spreadsheets/d/1Aii_IkUTiJH6t14n2lhwhu4PJJjlTz6X5vdEi5gPGa0/edit#gid=1305569144)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## View all questions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "qs_df"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "cell_metadata_filter": "-all",
   "main_language": "python",
   "notebook_metadata_filter": "-all"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
